An R example: ashr benchmark

This is a more advanced application of DSC with R scripts. We demonstrate in this tutorial features of DSC2 including:

  • Inline code as input parameters
  • Alias: for executables, parameters and return values
  • R library installation and version check
  • Extracting results via combinations of tags

DSC Problem

The DSC problem is based on the ASH example of DSCR (R Markdown version and HTML version). Material to run this tutorial can be found in DSC2 vignettes repo. Description below is copied from the DSCR vignette:

To illustrate we consider the problem of shrinkage, which is tackled by the ashr package at http://www.github.com/stephens999/ashr. The input to this DSC is a set of estimates $\hat\beta$, with associated standard errors $s$. These values are estimates of actual (true) values for $\beta$, so the meta-data in this case are the true values of beta. Methods must take $\hat\beta$ and $s$ as input, and provide as output "shrunk" estimates for $\beta$ (so output is a list with one element, called beta_est, which is a vector of estimates for beta). The score function then scores methods on their RMSE comparing beta_est with beta.

First define a datamaker which simulates true values of $\beta$ from a user-specified normal mixture, where one of the components is a point mass at 0 of mass $\pi_0$, which is a user-specified parameter. It then simulates $\hat\beta \sim N(\beta_j,s_j)$ (where $s_j$ is again user-specified). It returns the true $\beta$ values and true $\pi_0$ value as meta-data, and the estimates $\hat\beta$ and $s$ as input-data.

Now define a method wrapper for the ash function from the ashr package. Notice that this wrapper does not return output in the required format - it simply returns the entire ash output.

Finally add a generic (can be used to deal with both $\pi$ and $\beta$) score function to evaluate estimates by ash.

DSC Specification

The problem is fully specified in DSC2 language below, following the structure of the original DSCR implementation:

simulate:
      exec: datamaker.R
      seed: R(1:5)
      params:
          g: Asis(ashr::normalmix(c(2/3,1/3),c(0,0),c(1,2))),
             Asis(ashr::normalmix(rep(1/7,7),
             c(-1.5,-1,-0.5,0,0.5,1,1.5),rep(0.5,7))),
             Asis(ashr::normalmix(c(1/4,1/4,1/3,1/6),
             c(-2,-1,0,1),c(2,1.5,1,1)))
          min_pi0: 0
          max_pi0: 1
          nsamp: 1000
          betahatsd: 1
          .alias: args = List()
      return: data, true_beta = R(data$meta$beta), 
              true_pi0 = R(data$meta$pi0)

  shrink:
      exec: runash.R
      params:
          input: $data
          mixcompdist: normal, halfuniform
      return: ash_data, beta_est = R(ashr::get_pm(ash_data)),
              pi0_est = R(ashr::get_pi0(ash_data))

  beta_score:
      exec: score.R
      .alias: score_beta
      params:
          beta_true: $true_beta
          beta_est: $beta_est
          .alias: est = beta_est, truth = beta_true
      return: result

  pi0_score:
      exec: score.R
      .alias: score_pi0
      params:
          pi0_est: $pi0_est
          pi0: $true_pi0
          .alias: est = pi0_est, truth = pi0
      return: result

  DSC:
      run: simulate *
           shrink *
           (beta_score, pi0_score)
      R_libs: stephens999/ashr (2.0.0+)
      exec_path: bin
      output: dsc_result

This is a more complicated example. It is suggested that you walk through every DSC block, cross-referencing the corresponding R code for datamaker, method wrapper and score function to figure out how DSC2 communicates with your R program.

Next we will walk though each DSC blocks and illustrate some highlights.

Block simulate

Inline R code as parameters

The parameter g has three candidate values, all of which are R codes within Asis() function. Contents inside Asis() will be interpreted as functional code pieces rather than strings. In other words, DSC2 will interpret it as g <- ashr::normalmix(c(2/3,1/3),c(0,0),c(1,2)) so that g will be assigned output of R codes in Asis() for use with datamaker.R. Without Asis, this line will be interpreted as a string assigned to g which will cause troubles.

Parameter alias for R list

Inside datamaker.R the input for the core function is a single parameter of an R list containing all parameters specified in this block. The .alias entry uses a special DSC2 operation List() to consolidate these parameters into an R list args which corresponds to the input parameter in datamaker.R.

Return alias to extract output

The return object is data which is consistent with codes in datamaker.R. However we want to extract some variables from data for use with other steps. This is achieved by return alias which creates variables based on existing values. R() operator is used here to extract information from existing objects using R syntax.

Block shrink

Here notice the return alias pi0_est which uses the get_pi0 function in R() operator to extract information from existing output ash_data.

Blocks beta_score & pi0_score

These two blocks uses the same computational routine score.R but on different input data. Adjustment have to be made via .alias to distinguish these blocks for DSC output and to match input variable names for score.R. Executable .alias renames the computational routines from generic score.R to score_beta and score_pi0 respectively. These routine names will become part of column names in the DSC output database and they should be made distinct. Parameter .alias converts input variables names to variable names matching what has been coded in score.R. It is possible to use these names directly, e.g.,

beta_score:
        ...
        params:
            truth: $true_beta
            est: $beta_est
        ...

The DSC will also work. Here for clarity and for demonstration of parameter alias we use informative names as parameter names in DSC, and convert these names to what is required by the R codes via .alias.

Notice too that different from the DSCR ASH example the output score is a simple value (a float of RMSE). If the outcome of the R code is not a simple object, for example it returns a list variable score_output, then you may want to use return alias to extract important information to simple values so that they'll be readily extractable down the line, e.g., return: score_output, mse = R(score_output$mse).

DSC section

The DSC::run executes two sequences which we have discussed in previous tutorials. The R_libs entry specifies the R package required by the DSC. It is formatted as a github package (repo/pkg) and the minimal version requirement is 2.0.0. DSC will check first if the package is available, and install it if necessary. It will then check its version and quit on error if it does not satisfy the requirement. DSC does not attempt to change a package for version mismatch.

Execution logic

This diagram (generated by dot command using the execution graph from this DSC) shows the logic of this benchmark:

ash.png

Run DSC

In [1]:
! dsc -x settings.dsc -j 8 --seeds "R(1:50)" -f
INFO: Checking R library stephens999/ashr ...
INFO: DSC script exported to dsc_result.html
INFO: Constructing DSC from settings.dsc ...
INFO: Building execution graph ...
Running core_shrink_1 (runash) (00:02:00): 

Running beta_score (00:01:00): 
Running core_pi0_score_1 (score_pi0) (00:01:00): 
Running pi0_score (00:01:00): 
DSC: 100%|██████████| 5/5 [04:10<00:00, 30.50s/it]
INFO: Writing output metadata ...
INFO: DSC complete!
INFO: Elapsed time 252.567 seconds.

Result annotation

The DSCR ASH example adds names to various simulation settings and methods. Here we use DSC annotation file to reproduce the DSCR example. We create a settings.ann file as follows:

An:
  simulate:
    g: Asis(ashr::normalmix(c(2/3,1/3),c(0,0),c(1,2)))

Bn:
  simulate:
    g: Asis(ashr::normalmix(rep(1/7,7),c(-1.5,-1,-0.5,0,0.5,1,1.5),rep(0.5,7)))

Cn:
  simulate:
    g: Asis(ashr::normalmix(c(1/4,1/4,1/3,1/6),c(-2,-1,0,1),c(2,1.5,1,1)))

ash_n:
  shrink:
    mixcompdist: normal

ash_nu:
  shrink:
    mixcompdist: halfuniform

and we apply this annotation to our benchmark:

In [2]:
! dsc -a settings.ann
INFO: Annotation summary for sequence ending with beta_score
+--------------+----------------------------------------------------------+
|  Tag         |  No. unique obj.                                         |
+--------------+----------------------------------------------------------+
|  An      |  100 beta_score & 100 shrink & 50 simulate   |
|  Bn      |  100 beta_score & 100 shrink & 50 simulate   |
|  Cn      |  100 beta_score & 100 shrink & 50 simulate   |
|  ash_n   |  150 beta_score & 150 shrink & 150 simulate  |
|  ash_hu  |  150 beta_score & 150 shrink & 150 simulate  |
+--------------+----------------------------------------------------------+
INFO: Annotation summary for sequence ending with pi0_score
+--------------+---------------------------------------------------------+
|  Tag         |  No. unique obj.                                        |
+--------------+---------------------------------------------------------+
|  An      |  100 pi0_score & 100 shrink & 50 simulate   |
|  Bn      |  100 pi0_score & 100 shrink & 50 simulate   |
|  Cn      |  100 pi0_score & 100 shrink & 50 simulate   |
|  ash_n   |  150 pi0_score & 150 shrink & 150 simulate  |
|  ash_hu  |  150 pi0_score & 150 shrink & 150 simulate  |
+--------------+---------------------------------------------------------+
INFO: Elapsed time 0.139 seconds.

Result extraction

Obtain final score for methods comparison

FIXME: add link to shinydsc

Suppose we are interested in performance of methods ash_n and ash_nu in estimating $\pi_0$ under simulation setting An. We extract the data using tags created, and write to file ashr_pi0_1.rds:

In [3]:
! dsc -e pi0_score:result --target pi0_score -o ashr_pi0_1.rds \
    --tags "case1 = An && ash_n" "case2 = An && ash_nu"
Extracting: 100%|██████████| 3/3 [00:00<00:00,  5.66it/s]
INFO: Data extracted to ashr_pi0_1.rds for DSC result pi0_score via annotations: 
	case1 = An && ash_n
	case2 = An && ash_nu
INFO: Elapsed time 0.981 seconds.

We can examine the result in R, similar to what we have done in the Quick Start example:

In [4]:
%use ir
dat = readRDS('ashr_pi0_1.rds')
case1 = unlist(dat$case1_pi0_score_result)
case2 = unlist(dat$case2_pi0_score_result)
print(c(mean(case1), mean(case2)))
[1] 0.1763984 0.2004572
In [5]:
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "MSE for pi_0 estimate")
htmlwidgets::saveWidget(as.widget(p), "pi0_score.html")
IRdisplay::display_html(paste(readLines("pi0_score.html"), collapse="\n"))

Obtain intermediate output

You can also extract quantities of interest in any steps in a DSC sequence. For example we want to compare MSE for posterior mean estimate, and at the same time we want to explore the distribution of posterior mean. We first extract both quantities:

In [6]:
! dsc -e beta_score:result shrink:beta_est --target beta_score -o ashr_beta_1.rds \
    --tags "case1 = An && ash_n" "case2 = An && ash_nu"
Extracting: 100%|██████████| 5/5 [00:01<00:00,  4.75it/s]
INFO: Data extracted to ashr_beta_1.rds for DSC result beta_score via annotations: 
	case1 = An && ash_n
	case2 = An && ash_nu
INFO: Elapsed time 2.085 seconds.

Then we plot them both:

In [7]:
%use ir
dat = readRDS('ashr_beta_1.rds')
case1 = unlist(dat$case1_beta_score_result)
case2 = unlist(dat$case2_beta_score_result)
case1_beta = rowMeans(data.frame(dat$case1_shrink_beta_est))
case2_beta = rowMeans(data.frame(dat$case2_shrink_beta_est))
#
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "MSE for beta estimate")
htmlwidgets::saveWidget(as.widget(p), "beta_score.html")
IRdisplay::display_html(paste(readLines("beta_score.html"), collapse="\n"))
In [8]:
p = plot_ly(x = case1_beta, name = 'case 1', opacity = 0.9, type = "histogram") %>%
  add_trace(x = case2_beta, name = 'case 2', opacity = 0.9, type = "histogram") %>%
  layout(title = "Posterior mean distribution")
htmlwidgets::saveWidget(as.widget(p), "beta.html")
IRdisplay::display_html(paste(readLines("beta.html"), collapse="\n"))

Benchmarking runtime

You can also benchmark the time it takes to run a computational step. For example:

In [10]:
case1 = unlist(dat$DSC_TIMER$case1_shrink_beta_est)
case2 = unlist(dat$DSC_TIMER$case2_shrink_beta_est)
#
suppressMessages(library(plotly))
p = plot_ly(y = case1, name = 'case 1', type = "box") %>%
  add_trace(y = case2, name = 'case 2', type = "box")  %>% 
  layout(title = "Timer elapsed for posterior mean estimation")
htmlwidgets::saveWidget(as.widget(p), "beta_time.html")
IRdisplay::display_html(paste(readLines("beta_time.html"), collapse="\n"))